The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cData = pd.read_csv("vehicle.csv")
cData.shape
cData.head()
datatype=cData.dtypes
print(datatype)
cData.info()
cData.describe().transpose()
There is significant skewness present for variables radious_ratio,pr.axis_aspect_ratio,max.length_aspect_ratio,scaled_variance,scaled_variance.1,scaled_radius_of_gyration.1 ,skewness_about,skewness_about.1
Outliers for those variable are prominantly visible from 5 Number Summary
print(cData.groupby('class').size())
cData.isnull().values.any()
#There are missing values .instead of dropping the rows, lets replace the missing values with median value.
cData.median()
# every column's missing value is replaced with that column's median respectively (axis =0 means columnwise)
cData = cData.fillna(cData.median())
cData.isnull().values.any()
Now there is no missing values
# independant variables
X = cData.drop(['class'], axis=1)
# the dependent variable
y = cData[['class']]
sns.pairplot(cData, diag_kind='kde') # to plot density curve instead of histogram on the diag
We should chose those idependent variables for PCA which has some linear relations among them. pr.axis_aspect_ratio,max.length_aspect_ratio ,skewness_about ,skewness_about.1 these 4 independet variable have hardly any linear relation with other independt variables Doing PCA on this variables will not have much value.
#from sklearn.model_selection import train_test_split ## already imported at the start with other libraries
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
X_train.shape
#from sklearn.svm import SVC ## already imported at the start with other libraries
svc = SVC(gamma='auto')
svc.fit(X_train, y_train.values.ravel())
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
#The model overfits substantially with a perfect score on the training set and only 65% accuracy on the test set.
#SVM requires all the features to be on a similar scale. We will need to rescale our data that all the features are approximately
#on the same scale and than see the performance
#from scipy.stats import zscore
X_trainScaled=X_train.apply(zscore)
X_testScaled =X_test.apply(zscore)
svc = SVC(gamma='auto')
svc.fit(X_trainScaled, y_train.values.ravel())
print("Accuracy on training set: {:.2f}".format(svc.score(X_trainScaled, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_testScaled, y_test)))
Scaling the data made a huge difference.Now training and test set performance are quite similar and close to 100% accuracy. So no need to test the model with different C value
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
svc = SVC(gamma='auto')
svc.fit(X_train_scaled, y_train.values.ravel())
print("Accuracy on training set: {:.2f}".format(svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test_scaled, y_test)))
Using MinMaxScaler Scaling the data made a huge difference.But now we are actually in an underfitting regime, where training and test set performance are quite similar but less close to 100% accuracy. From here, we can try increasing either C or gamma to fit a more complex model.
svc = SVC(C=1000,gamma='auto')
svc.fit(X_train_scaled, y_train.values.ravel())
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
Here, increasing C allows us to improve the model, resulting in 88.6% train set accuracy.
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = SVC(gamma='auto')
results = cross_val_score(model, X, y.values.ravel(), cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
XScaled=X.apply(zscore)
XScaled.head()
covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
pca = PCA(n_components=18)
pca.fit(XScaled)
The eigen Values
print(pca.explained_variance_)
The eigen Vectors
print(pca.components_)
And the percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cumulative of variation explained')
plt.xlabel('eigen Value')
plt.show()
Now 7 dimensions seems very reasonable. With 7 variables we can explain over 95% of the variation in the original data!
pca7 = PCA(n_components=7)
pca7.fit(XScaled)
print(pca7.components_)
print(pca7.explained_variance_ratio_)
Xpca7 = pca7.transform(XScaled)
Xpca7
sns.pairplot(pd.DataFrame(Xpca7))
The new 7 variables looks independent to with one another
X_train, X_test, y_train, y_test = train_test_split(Xpca7, y, test_size=.30, random_state=1)
svc = SVC(gamma='auto')
svc.fit(X_train, y_train.values.ravel())
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = SVC(gamma='auto')
results = cross_val_score(model, Xpca7, y.values.ravel(), cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
Looks like by drop reducing dimensionality by 11, we only dropped around 1% accuracy in training data and 4% in test data.It seems easy to justify the dropping of variables as the accuracy is not dropping significantly for Support vector machines – one trained using raw data and the other using Principal Components with reduced dimensionality.
For K-fold cross validation cross validation score has improved significantly and accuracy increased from
50.959% (6.044%) to 92.905% (2.066%)
for Support vector machines – one trained using raw data and the other using Principal Components with reduced dimensionality.